Directed acyclic graphs

Directed acyclic graphs are the brain child of computer scientist Judea Pearl, who developed them to reason systemically about causal inference. Judea Pearl’s The Book of Why provides an accessible introduction to directed acyclic graphs and makes for a nice holiday reading.

Let’s assume we want to generate a data set with the following variables:

Before we start simulating anything, we might want to think about the possible causal relationship between these variables. Drawing graphs is a good way to do this. Here are some possible scenarios:

library(dagitty) 
par(mfrow = c(1,3), mar=c(3,3,0,1))
dag1 = dagitty(
  "dag{
  D->B;
  B->C;
  D->C
  }")
coord.list = 
  list(
    x=c(D=0,B=2,C=1),
    y=c(D=0,B=0,C=1))
coordinates(dag1) = coord.list
drawdag(dag1, cex = 2)
dag2 = dagitty(
  "dag{
  D->B;
  D->C
  }")
coordinates(dag2) = coord.list
drawdag(dag2, cex = 2)

dag3 = dagitty(
  "dag{
  D->B;
  B->C
  }")
coordinates(dag3) = coord.list
drawdag(dag3, cex = 2)


These graphs are build from two types of components:

All graphs are made of nodes and edges. What is special about directed acyclic graphs is that

  1. the relationship between two variables has a direction, such that
    • variables at the start of the arrows are causes and
    • variables at the end are effects
  2. a succession of arrows must not form a circle, that is, a variable cannot cause itself, even indirectly1. For example, \(\small D\rightarrow B\rightarrow C\rightarrow D\) is not allowed.

To simulate data from a DAG, we have to decide what distributions the variables have. To keep things simple, we just use normally distributed data for now. The order in which we simulate variables is determined by the DAG. The first thing we do is to check which variables are exogenous to our model (DAG), because these have to be modeled first.

Let us focus on the left DAG, which we show here again:

drawdag(dag1, cex = 0.5, lwd = 1)

Here, \(D\) is the exogenous variable, so we simulate it first.

set.seed(12345)
N = 250
D = rnorm(N)


\(\small B\) and \(\small C\) are both endogenous variables, they are both determined by other variables in our model/DAG, but \(\small B\) is a parent of \(\small C\), so we have to simulate \(\small B\) next:

B = 1*D + rnorm(N)


Finally, we can calculate \(\small C\) as an effect of \(\small D\) and \(\small B\)

C = 1*D + 0.2*(B) + rnorm(N) 

A naive regression analysis

As a recap, here is the model for a simple linear regression that models \(\small C\) as a function of \(\small B\).

What Notation quap R-code
Likelihood \(C_i \sim Normal(\mu,\sigma)\) C ~ dnorm(mu, sigma)
Linear model \(\mu_i = \alpha + \beta B_i\) mu[i] <- a + b*B[i]
Prior \(\alpha \sim Normal(0,.5)\) alpha ~ dnorm(0, .5)
Prior \(\beta \sim Normal(0,2)\) beta ~ dnorm(0, 1)
Prior \(\sigma \sim Exponential(1)\) sigma ~ dunif(1)


Here we put the data together and estimate the model with quap:

coefkable = function(m,caption) {
  cap = paste("Coeffcients for model",caption)
  m %>% 
    precis %>% 
    as.matrix() %>% 
    kable(caption = cap, digits = 2) %>% 
    kable_styling(full_width = F)
}
d = data.frame(C = C, B = B, D = D)

m.C_B <-quap(
  alist(
    C ~ dnorm(mu,sigma),
    mu <- a + bB*B,
    a ~ dnorm(0,.5),
    bB ~ dnorm(0,1),
    sigma ~ dexp(1)),
  data=d)
# un-hide previous code chunk for the coefkable function
coefkable(m.C_B, "m.C_B")
Coeffcients for model m.C_B
mean sd 5.5% 94.5%
a 0.08 0.08 -0.04 0.21
bB 0.73 0.05 0.64 0.82
sigma 1.24 0.06 1.15 1.33

And we plot the associations implied by prior and posterior:

par(mfrow = c(1,2), mar=c(3,3,0,1))
prior = extract.prior(m.C_B)
mu = link(m.C_B,
          post=prior,
          data=list(B=c(-3,3)))
matplot(c(-3,3),
        t(mu[1:50,]),
        xlab = "B", ylab = "C", ylim = c(-4.5,4.5),
        type = "l", col = "blue", lty = 1)
title("Prior predictive samples")

B_seq = seq(from=-3,to=3,length.out=30)
mu = link(m.C_B,data=list(B=B_seq))
mu.mean <-apply(mu,2,mean)
mu.HPDI <-apply(mu,2,HPDI)
plot( C~B,data=d, cex = .5,
      col=adjustcolor("black", alpha = .2),
      ylim = c(-4.5,4.5),
      xlim = c(-3,3))
lines( B_seq,mu.mean,lwd=2, col = "blue")
shade( mu.HPDI, B_seq, col = adjustcolor("blue", alpha = .2))
title("Posterior predictive mean & HDPI")

So, if we disregard our knowledge about daily reading duration of parents, we conclude that there is a strong relationship between number of books in a household and child reading ability. If one would be willing to interpret the association as causal, one could even think that number of books in a household are an important cause of child reading ability.

Conditional independencies in DAGs

This example makes clear that we can come to wrong conclusions if our analysis model, here the simple linear regression \(\small C \sim Normal(\alpha+\beta B,\sigma)\) (the right DAG in the first figure), does not match the data generating process or model, which is \(\small C \sim Normal(\alpha+\beta_1 B + \beta_2 D,\sigma)\) (the left DAG in the first figure).

The question then is, how can we figure out what the correct data generating process (true DAG) is? Domain knowledge is obviously useful. In addition one can also use the fact that DAGs (can) imply conditional independencies. Two variables are independent conditional on a third variables when the first two variables are unrelated given that one knows the value of the third variable or fixes it to a specific value.

Let’s look at the left and right DAGs from above again:

par(mfrow = c(1,2), mar=c(3,3,0,1))
drawdag(dag1, cex = 0.5, lwd = 1)
drawdag(dag3, cex = 0.5, lwd = 1)

For the left DAG:

  • if we condition on \(\small D\), \(\small B\) and \(\small C\) are still dependent because of \(\small B \rightarrow C\)
  • if we condition on \(\small B\), \(\small D\) and \(\small C\) are still dependent because of \(\small D \rightarrow C\)
  • if we condition on \(\small C\), \(\small D\) and \(\small B\) are still dependent because of \(\small D \rightarrow B\)

So if the left DAG is correct, \(\small D\) and \(\small C\) are still related after we account for the relationship of \(\small B\) and \(\small C\).

Onto the right DAG:

  • if we condition on \(\small D\), \(\small B\) and \(\small C\) are still dependent because of \(\small B \rightarrow C\)
  • if we condition on \(\small B\), \(\small D\) and \(\small C\) are independent
  • if we condition on \(\small C\), \(\small D\) and \(\small B\) are still dependent because of \(\small D \rightarrow B\)

The DAG to the right has an implied conditional independence that we can check with our data.

Determining conditional dependencies gets harder when DAGs are more complex. In this case one can use software like the R package dagitty to determine conditional independencies as follows:

library(dagitty)
impliedConditionalIndependencies(dag1)


impliedConditionalIndependencies(dag2)
## B _||_ C | D

here B _||_ C | D means that \(\small B\) is independent of \(\small C\) conditional on \(\small D\), which you can also find written as \((\small B \perp\!\!\!\perp \small C \mid \small D)\).

impliedConditionalIndependencies(dag3)
## C _||_ D | B

Now that we have learned that under some DGPs there should be conditional independcies, how can we test them?

Remember that we defined conditional independence as independence given that we already know the third variable. One way to incorporate what we already know about a third variable is to use multiple (linear) regression.

Assuming that \(\small D\) is our outcome of interest, we can check if, in a model that uses both \(\small D\) and \(\small B\) to predict \(\small C\), \(\small B\) remains associated with \(\small C\) when we also account for the effect of \(\small D\). If it does, this speaks in favour of dag1. If it does not, it speaks in favour of dag3.

Multiple regression

Extending the linear model

To go from a simple to a multiple linear regression, we simply add a predictor variable to the linear predictor:

What Notation quap R-code
Simple linear model \(\mu_i = \alpha + \beta_B B_i\) mu[i] <- a + bB*B[i]
Multiple linear model \(\mu_i = \alpha + \beta_B B_i + \beta_D D_i\) mu[i] <- a + bB*B[i] + bD*D[i]

Accordingly, the quap model for the multiple linear regression looks a lot like the one for simple linear regression.

m.C_BD <-quap(
  alist(
    C ~ dnorm(mu,sigma),
    mu <- a + bB*B + bD*D,
    a ~ dnorm(0,0.2),
    bB ~ dnorm(0,1),
    bD ~ dnorm(0,1),
    sigma ~ dexp(1)),
  data=d)
coefkable(m.C_BD,"m.C_BD")
Coeffcients for model m.C_BD
mean sd 5.5% 94.5%
a 0.04 0.06 -0.06 0.15
bB 0.19 0.07 0.07 0.30
bD 0.97 0.10 0.81 1.13
sigma 1.06 0.05 0.98 1.13

Looking at the quap model, you can see that we use the same parameters for the priors for bD and bB. This makes sense because both variables have the same scale (variability). We might want to change this if the variables have different scales / variability.

Because this is a fairly simple model and less model checking is needed, we directly look at the regression coefficients from the first model m.C_B, which used only \(\small B\) as a predictor, and on the last model we estimated, m.C_BD,which used \(\small B\) and \(\small D\).

plot(coeftab(m.C_BD,m.C_B),par=c("bD","bB"), xlim = c(0,1.2))

First, we check the conditional independence that would justify to only use \(\small B\) as a predictor, namely that \(\small \beta_D\) or bD is (close to) zero in the coefficient from the m.C_BD model. The figure shows that this is clearly not the case and that therefore the m.C_B is not the correct model to estimate the effect of \(\small B\). Instead, we must use the m.C_BD model.

The other thing we can see is that the estimated association between number of books in the household and is weaker in the model m.C_BD, which uses an analysis consistent with the correct data generating process. This is an example of confounding bias, which is a more general term of which spurious association is a special case. If we omit a common cause of our predictor and outcome from the analysis, this leads to a biased estimation of the association between predictor and outcome. Also note that the regression parameters we estimated with the m.C_BD model are consistent with the values we use to generate the data above.

Mediation, direct and indirect effects

Did we now successfully estimate the effect of daily reading hours on child reading?

To answer this question, we first need to establish that we can estimate two types of effects

  • direct effects, effects that go only via the direct arrow from the predictor to the outcome
  • total effects, the sum of the direct effects and indirect effects that are “communicated” via one ore more mediator variables.

The m.C_BD model estimates the direct effect but not the total effect, because the coefficient for \(\small D\) is calculated while adjusting for the indirect effect via \(\small B\). If we want to estimate the total effect of \(\small D\), we have to model \(\small C\) only as an effect of \(\small D\).2

Here is the regression model:

m.C_D <-quap(
  alist(
    C ~ dnorm(mu,sigma),
    mu <- a + bD*D,
    a ~ dnorm(0,.5),
    bD ~ dnorm(0,1),
    sigma ~ dexp(1)),
  data=d)
coefkable(m.C_D,"m.C_D")
Coeffcients for model m.C_D
mean sd 5.5% 94.5%
a 0.05 0.07 -0.05 0.16
bD 1.17 0.06 1.07 1.27
sigma 1.07 0.05 1.00 1.15

Now we can plot the coefficients from the m.C_D and m.C_BD models together:

plot(coeftab(m.C_BD,m.C_D),par=c("bD"))

As expected, the total effect from the m.C_D model is larger than the direct effect from the m.C_BD model.

Think before you regress, and afterwards

One important consequence of thinking systematically about the processes that produced the data, and representing and analysing them with DAGs, is that what variables belong into a regression analysis depends on two things:

  • What is the (assumed) data generating process
  • What do we want to estimate

Depending on the answers to these questions, different variables have to be included into a regression.

One common mistake, which has the name Table 2 fallacy, where Table 2 is often the 2nd table in medical articles3, is to interpret multiple coefficients from a regression model as causal effects. This is in most cases not a valid approach.

Plotting to learn

Plotting against residuals

Plotting against residuals tell us what multiple regression does.

We discussed earlier that in multiple regression each regression coefficient captures the effect of a variable conditional on that one accounts for the effect of all other variables.

This means that we only look at the effect of the variation in that variable that is independent from the variation of the other variables. If we look for example at the two variables \(\small B\) and \(\small D\), these are correlated:

plot(D,B)

If we run a regression like \(\small C \sim Normal(\alpha + \beta B, \sigma)\) the coefficient \(\beta\) will capture the effect of the total variation of \(\small B\), which includes some variation due to \(\small D\). This becomes obvious if you remember that we simulated B = 1*D + rnorm(N).

One way to do a regression that returns the effect of \(\small B\) that is not due to \(\small D\), is to create a new variable \(\small B_R\) that does not include information that is also in \(\small D\). We can do this by regressing \(\small B \sim Normal(\alpha + \beta D, \sigma)\) and calculating \(\small B_R\) as the difference of the linear predictor (\(\small \mu_i\) or \(\small mu[i]\)) and \(\small B\).

m.B_D <-quap(
  alist(
    B ~ dnorm(mu,sigma),
    mu <- a+bD*D,
    a ~ dnorm(0,.5),
    bD ~ dnorm(0,1),
    sigma ~ dexp(1)),
  data=d)

The following figure shows observed and predicted values on the left, where vertical lines from observed values to the regression line are residuals. The right hand side shows the residuals plotted again \(\small D\). As expected these are uncorrelated, which means that \(\small B_r\) does not include any information that is also in \(\small D\).

# calculate the linear predictor mu
# for all observed data points
mu = colMeans(link(m.B_D,data=d))

# mu to plot regression lines
mu.l = colMeans(link(m.B_D,data=list(D = c(-3,3))))

par(mfrow = c(1,2), mar=c(3,3,0,1))

# select only a subset of values
# otherwise the plot gets very busy
idx = sample(nrow(d),75)

# plot raw data
plot(D[idx],B[idx], ylab = "B (estimated)", xlab = "D")

# plot regression line
lines(c(-3,3),mu.l, col = "blue")

# plot residual lines
for (i in idx) {
  lines(rep(D[i],2), c(B[i],mu[i]), col = "blue")
}

# calculate residuals
BR = B - mu
# plot residuals of BR against predictor D 
plot(D[idx],BR[idx], ylab = expression(B[R]), xlab = "D")

Here is an animation that shows how residuals capture only the variation in \(\small B\), after we have removed the effect of \(\small D\). In the animation, we do this by gradually reducing the slope and ending at a value of 0 for the slope, which indicates no influence of \(\small D\) on \(\small B\).

Understanding residuals

Now we can use \(\small B_R\) as a predictor of \(\small C\) and plot the regression line:

d$BR =  BR
m.C_BR <-quap(
  alist(
    C ~ dnorm(mu,sigma),
    mu <- a+bBR*BR,
    a ~ dnorm(0,.5),
    bBR ~ dnorm(0,1),
    sigma ~ dexp(1)),
  data=d)
coefkable(m.C_BR,"m.C_BR") 
Coeffcients for model m.C_BR
mean sd 5.5% 94.5%
a 0.20 0.10 0.03 0.36
bBR 0.19 0.11 0.01 0.36
sigma 1.62 0.07 1.50 1.74

Here are the regression lines for the model with \(\small B_R\) as the predictor (left) and with \(\small B\) as the predictor (right). As expected, the relationship is much stronger for \(\small B\). Also compare the regression coeffcients for \(\small B_R\) with the regression coeffcient for the \(\small B\) in the model m.C_BD. They are nearly identical.

par(mfrow = c(1,2), mar=c(3,3,0,1))

## Regression on residuals
BR_seq = seq(from=-3,to=3,length.out=30)
mu = link(m.C_BR,data=list(BR=BR_seq))
mu.mean <-apply(mu,2,mean)
mu.HPDI <-apply(mu,2,HPDI)
plot( C~BR,data=d, cex = .5,
      col=adjustcolor("black", alpha = .2),
      ylim = c(-4.5,4.5),
      xlim = c(-3,3),
      xlab = expression(B[R]))
lines( B_seq,mu.mean,lwd=2, col = "blue")
shade( mu.HPDI, B_seq, col = adjustcolor("blue", alpha = .2))

## Regression on original B
B_seq = seq(from=-3,to=3,length.out=30)
mu = link(m.C_B,data=list(B=B_seq))
mu.mean <-apply(mu,2,mean)
mu.HPDI <-apply(mu,2,HPDI)
plot( C~B,data=d, cex = .5,
      col=adjustcolor("black", alpha = .2),
      ylim = c(-4.5,4.5),
      xlim = c(-3,3))
lines( B_seq,mu.mean,lwd=2, col = "blue")
shade( mu.HPDI, B_seq, col = adjustcolor("blue", alpha = .2))

Posterior predictions

Posterior predictions tell us if the model is any good.

If our model capture the data well, the observed and model-predicted data should be correlated. We can visualize this correlation in a scatter plot. Because we do not expect that the predictions exactly coincide with the observed values, it is useful to add credible intervals or HDPIs to the predicted values. Then, one can check if those overlap with the regression line that visualizes perfect coincidence of observed and predicted data.

# no newdata, so uses original data
mu = link(m.C_BD)
# summarize samples across cases
mu_mean = apply(mu,2,mean)
mu_PI = apply(mu,2,PI) # credible intervals
# simulate observations
# again no newdata, so uses original data
C_sim = rethinking::sim(m.C_BD,n=1e3)
C_PI = apply(C_sim,2,PI)  # prediction intervals

When we add HDPIs or credible intervals, we can calculate those for the linear predictions, the right plot in the next figure, or for posterior predictions, the left plot in the next figure. The difference in intervals for linear and posterior predictions is that the latter also captures uncertainty about predictions for each individual (which depends on the standard deviation \(\sigma\) of the Gaussian likelihood), whereas the former only captures uncertainty about the slope of the regression line, i.e. \(\beta\).

If the main concern is if the observed data are within the most plausible predicted data, one should use credible intervals from posterior predictions. If one has for example a 90% credible interval, 90% of the observed values should be within the credible interval of the predicted values.

If the main concern is to detect systematic or strong deviations between observed and predicted data, these can be easier seen if one uses credible intervals from the linear predictions.

par(mfrow = c(1,2), mar=c(3,3,0,1))
## posterior predictions + prediction intervals
plot(mu_mean[idx]~-d$C[idx],
     col="blue", cex = .5,
     xlab="Observed child reading",
     ylab="Predicted child reading",
     ylim = range(C_PI[,idx]))
# regression line for perfect coincidence
# of observed and predicted values
abline( a=0,b=1,lty=2)
# add credible intervals to predictions
for (i in idx) 
  lines(rep(d$C[i],2),
        C_PI[,i],
        col="blue")
title("posterior predictions + prediction intervals")

## posterior predictions + credible intervals
plot(mu_mean[idx]~-d$C[idx],
     col="blue", cex = .5,
     xlab="Observed child reading",
     ylab="Predicted child reading",
     ylim = range(C_PI[,idx]))
# regression line for perfect coincidence
# of observed and predicted values
abline( a=0,b=1,lty=2)
# add credible intervals to predictions
for (i in idx) 
  lines(rep(d$C[i],2),
        mu_PI[,i],
        col="blue")
title("posterior predictions + credible intervals")

In this example, there are no strong deviations between predicted and observed values. After all, the regression is based on the true model. Here are the same types of plots if the true DGP is a non-linear effect \(\small D\) on \(\small C\), but our model assumes only a linear relationship:

d2 = d
d2$C = d2$D - .5* d2$D^2 + .2*d2$B + rnorm(N)
m.C_BD2 <-quap(
  alist(
    C ~ dnorm(mu,sigma),
    mu <- a + bD*D + bB*B,
    a ~ dnorm(0,0.2),
    bD ~ dnorm(0,1),
    bB ~ dnorm(0,1),
    sigma ~ dexp(1)),
  data=d2)
mu = link(m.C_BD2)
mu_mean = apply(mu,2,mean)
mu_PI = apply(mu,2,PI) 
C_sim = rethinking::sim(m.C_BD2,n=1e3)
C_PI = apply(C_sim,2,PI)

par(mfrow = c(1,2), mar=c(3,3,0,1))
## posterior predictions + prediction intervals
plot(mu_mean[idx]~-d2$C[idx],
     col="blue", cex = .5,
     xlab="Observed child reading",
     ylab="Predicted child reading",
     ylim = range(C_PI[,idx]))

# regression line for perfect coincidence
# of observed and predicted values
abline( a=0,b=1,lty=2)
# add credible intervals to predictions
for (i in idx) 
  lines(rep(d2$C[i],2),
        C_PI[,i],
        col="blue")
title("posterior predictions + prediction intervals")

## posterior predictions + credible intervals
plot(mu_mean[idx]~-d2$C[idx],
     col="blue", cex = .5,
     xlab="Observed child reading",
     ylab="Predicted child reading",
     ylim = range(C_PI[,idx]))
# regression line for perfect coincidence
# of observed and predicted values
abline( a=0,b=1,lty=2)
# add credible intervals to predictions
for (i in idx) 
  lines(rep(d2$C[i],2),
        mu_PI[,i],
        col="blue")
title("posterior predictions + credible intervals")

Here, we see that the model systematically overestimates low values of child reading ability.

Counterfactual plots

Counterfactual plots tell us about the effect of an intervention, under the condition that we assume or have identified the correct data generating process our causal model (which may be represented as a DAG).

Let’s assume you are thinking about an intervention to improve child reading ability. You know that parental daily reading has a stronger effect than the number of books in the household. On the other hand, it is maybe easier to convince parents to accept some book gifts compared to making them read more. So you want to know by how much you can expect child reading to change if you increase either daily reading hours or number of books in a household.

This can be calculated with counterfactuals, so called because they calculate outcomes under conditions that were actually not observed. Counterfactual outcomes are computed according to following rules:

  • Set the manipulated variable to chosen values (some of which were not realized in the data). This makes the manipulated variable independent of all other variables in the DAG/model, it makes it “exogenous”.
  • Using coefficients from you model(s), calculate all down stream effects of the manipulated variable on the outcome. This includes direct and indirect effects.

Here is as a refresher our DAG:

drawdag(dag1, cex = 0.5, lwd = 1)

If we want to manipulate \(\small D\) and examine the effect on \(\small C\), we have to calculate the resulting direct effect \(\small D \rightarrow C\) and the indirect effect \(\small D \rightarrow B \rightarrow C\). In order to do this, we need regression coefficients for all these paths. With quap we can estimate them in one model:

m.BC <-quap(
  alist(
    ## D -> C <- B
    C ~ dnorm(mu,sigma),
    mu <- a + bB*B + bD*D,
    a ~ dnorm(0,0.2),
    bD ~ dnorm(0,0.5),
    bB ~ dnorm(0,0.5),
    sigma ~ dexp(1),
    ## D -> B
    B ~ dnorm(mu_B,sigma_B),
    mu_B <- aB + bDB*D,
    aB ~ dnorm(0,0.2),
    bDB ~ dnorm(0,0.5),
    sigma_B ~ dexp(1)),
  data=d)

Next we specify a sequence of values for \(\small D\), for which we would like to estimate \(\small C\):

D_seq  = seq(from=-2, to=2, length.out=10)

We can use the sim function from the rethinking package to simulate first values of \(\small D\) and then, using the just simulate \(\small D\) values, also \(\small C\) values.

The vars argument to sim tells it both which observables to simulate and in which order.

# prepare the data
sim_dat = data.frame(D = D_seq)
# simulate B and then C,using D_seq
s  = rethinking::sim(m.BC,
         data = sim_dat,
         vars = c("B", "C"))
PIs = apply(s$C,2,PI)
plot(sim_dat$D,colMeans(s$C),
      type="l", col = "blue",
      xlab="manipulated D",
     ylab="counterfactual C",
     ylim = range(PIs))
shade(PIs,
       sim_dat$D,
      col = adjustcolor("blue",alpha = .2))
mtext( "Total counterfactual effect of D on C")

This is a large effect. Next we calculate by how many sd child reading gets better if we go from increase daily reading of parents from 0 to 1 (which is also a 1 sd increase because the sd of parental reading is 1.).

sim2_dat = data.frame(D = c(0,1))
s2 = rethinking::sim(m.BC,
                     data=sim2_dat,
                     vars=c("B","C"))
mean( s2$C[,2]-s2$C[,1])
## [1] 1.149065

Now we look at counterfactuals for \(\small B\). We calculate the effect of \(\small B\) while we fix \(\small D\) to zero.4

sim_datB = data.frame(
  B=seq(from=-2,to=2,length.out=10),
  D=0)
sB  = rethinking::sim(m.BC,
                      data=sim_datB,
                      vars="C")

And we plot the simulated data, with the counterfactual effect of \(\small D\) on the left side and of \(\small B\) on the right side.

par(mfrow = c(1,2), mar=c(3,3,0,1))
plot(sim_dat$D,colMeans(s$C),
      ylim = range(PIs), type="l", col = "blue",
      xlab="manipulated D",
     ylab="counterfactual C")
shade(apply(s$C,2,PI),
       sim_dat$D,
      col = adjustcolor("blue",alpha = .2))
mtext( "Total counterfactual effect of D on C")


plot(sim_datB$B,colMeans(sB),
     ylim = range(PIs), type="l", col = "blue",
    xlab="manipulated B",
     ylab="counterfactual C")
shade( apply(sB,2,PI),
       sim_datB$B,
       col = adjustcolor("blue",alpha = .2))
mtext( "Total counterfactual effect of B on C")

As expected, the effect of \(\small B\) is much weaker.

simB2_dat = data.frame(B=c(0,1), D = 0)
s2 = rethinking::sim(m.BC,
                     data=simB2_dat,
                     vars=c("C"))
mean(s2[,2]-s2[,1])
## [1] 0.1772927

Unmasking “hidden” relationships

May effects depend on multiple causes and our previous example on the effect o parental reading on child reading has shown that we need to be careful about confounders, common caused of the treatment of interest and the outcome, in order to avoid overestimating treatment effects.

In this section, we examine how ignoring concurrent causes can lead to underestimating true effects. As an example, we can think about the effect decision authority (\(\small A\)) and stress (\(\small S\)) on job satisfaction (\(\small W\)). What can we hypothesize about the relationship of these variables?

Here is the relationship between the three variables summarized in one DAG.

dag1 = dagitty(
  "dag{
  A->S;
  S->W;
  A->W
  }")
coord.list = 
  list(
    x=c(A=0,S=2,W=1),
    y=c(A=0,S=0,W=1))
coordinates(dag1) = coord.list
drawdag(dag1, cex = 0.5, lwd = 1)

Based on this DAG, we can simulate data:

set.seed(1234)
N = 75
A = rnorm(N)
S = A*.5 + rnorm(N)
W = 0.3*A - 0.3*S + rnorm(N,sd = 1)
d = data.frame(A = A, S = S, W = W)

We use a pairs plot to get a first impression of the data.

pairs(d)

This plot does not show a clear relationship between stress and satisfaction! The reason is that decision authority and stress are positively related, but also have opposite effects on job satisfaction.

A naive analysis

A naive analysis would for example simply assume that decision authority and stress are independent variables, that each influence job satisfaction. If this were true, one could estimate the effects with simple linear regressions. Here is the code for two simple quap models that do this:

m.A <-quap(
  alist(
    W ~ dnorm(mu,sigma),
    mu <- a + bA*A,
    a ~ dnorm(0,.5),
    bA ~ dnorm(0,1),
    sigma ~ dexp(1)),
  data=d)
m.S <-quap(
  alist(
    W ~ dnorm(mu,sigma),
    mu <- a + bS*S,
    a ~ dnorm(0,.5),
    bS ~ dnorm(0,1),
    sigma ~ dexp(1)),
  data=d)

Next, we plot the regression line with credible intervals to visualize how strong the estimated association is:

A_seq = seq(min(d$A), max(d$A), length.out = 50)
S_seq = seq(min(d$S), max(d$S), length.out = 50)

par(mfrow = c(1,2), mar=c(3,3,0,1))

mu = link(m.A,data=list(A=A_seq))
mu.mean <-apply(mu,2,mean)
mu.PI <-apply(mu,2,PI)
plot( mu.mean~A_seq,data=d,'l',col = "blue",
      ylim = c(-2,2),
      ylab = "W", xlab = "A")
shade( mu.PI, A_seq, col = adjustcolor("blue", alpha = .2))

mu = link(m.S,data=list(S=S_seq))
mu.mean <-apply(mu,2,mean)
mu.PI <-apply(mu,2,PI)
plot( mu.mean~S_seq,data=d,'l', col = "blue",
      ylim = c(-2,2),
      ylab = "W", xlab = "S")
shade( mu.PI, S_seq, col = adjustcolor("blue", alpha = .2))

Let’s check again if the model we used was reasonable. Given that we assumed that \(\small A\) and \(\small S\) are independent, this is that DAG that would justify the naive analysis:

naive.dag = dagitty(
  "dag{
  S->W;
  A->W
  }")
coord.list = 
  list(
    x=c(A=0,S=2,W=1),
    y=c(A=0,S=0,W=1))
coordinates(naive.dag) = coord.list
drawdag(naive.dag, cex = 0.5, lwd = 1)

And here are the implied conditional independencies of this DAG:

impliedConditionalIndependencies(naive.dag)
## A _||_ S

A short look at the pairs plot above shows us that \(\small A\) and \(\small S\) are not independent.

But we already new that this DAG was not correct. Now lets check the conditional independencies of the true DAG:

impliedConditionalIndependencies(dag1)

Indeed, this DAG does not have any conditional independencies. Another property of this DAG is that it is just one of many DAGs with the same skeleton, i.e. nodes and undirected connections, that can produce the same dependencies between variables. We can extract and plot all these dags, which belong in a markov equivalence class, wit a few short commands:

ME.dags <-equivalentDAGs(dag1) # equivalentDAGs is part of dagitty
drawdag(ME.dags)
mtext("(1)", outer = T, line = -17, adj = 0+.025,cex = .75)
mtext("(2)", outer = T, line = -17, adj = 1/3+.025,cex = .75)
mtext("(3)", outer = T, line = -17, adj = 2/3+.025,cex = .75)
mtext("(4)", outer = T, line = -36, adj = 0+.025,cex = .75)
mtext("(5)", outer = T, line = -36, adj = 1/3+.025,cex = .75)
mtext("(6)", outer = T, line = -36, adj = 2/3+.025,cex = .75)

The important insight here is that if we only observe a data set where \(\small A\), \(\small S\), and \(\small W\) are connected and no conditional independencies are present, we cannot infer which of the 6 DAGs shown in the previous figure generated the data. If we still want to make some progress, we need use domain knowledge to rule out DAGs. For instance, if there is certain domain knowledge that stress influences job satisfaction and not the other way around, DAGs (2), (3), and (6) can be excluded, and if one is certain that stress does not cause decision autonomy, DAGs (4), (5) and (6) can be excluded.

The key message here is that it is necessary to postulate a causal model, e.g. in form of a DAG, in order to decide over an analysis approach. If possible, we can use the data at hand to choose the correct, but more often than not domain knowledge is at least as important.

An informed analysis

If we want to estimate the direct effect of \(\small A\) given this causal model:

drawdag(dag1, cex = 0.5, lwd = 1)

we need to investigate how \(\small W\) varies when we vary \(\small A\) and hold \(\small S\) constant. To do this, we can again use counterfactuals, because we compute expected job satisfaction at combinations of \(\small A\) and \(\small W\) values that were not necessarily observed.

To prepare the calculation of counterfactuals, we first need to estimate job satisfaction as an effect of \(\small A\) and \(\small S\):

m.AS <-quap(
  alist(
    W ~ dnorm(mu,sigma),
    mu <- a + bA*A + bS*S,
    a ~ dnorm(0,.5),
    bA ~ dnorm(0,1),
    bS ~ dnorm(0,1),
    sigma ~ dexp(1)),
  data=d)

Did the coefficients for \(\small A\) and \(\small S\) change? Here is a plot of the coeffcients from the diffrent models:

plot( coeftab(m.A,m.S,m.AS),pars=c("bA","bS"))

Next we use the link function from the rethinking package to calculate values \(\small W\) for vary \(\small A\) while we fix \(\small S = 0\). We plot counterfactual effect with a solid line and add for comparison the effect from the unadjusted model (m.A) with a dotted line.

par(mar=c(3,3,0,1), mgp=c(1,.5,0), tck=-.01)
xseq = seq(from=min(d$A)-0.15, to=max(d$A)+0.15, length.out=50)
mu = link(m.AS,data=data.frame(A=xseq,S=0))
mu.mean <-apply(mu,2,mean)
mu.PI <-apply(mu,2,PI)
plot( mu.mean~xseq,data=d,'l', col = "blue",
      ylim = c(-2,2),
      ylab = "W", xlab = "A")
lines(x = xseq, y = coef(m.A)[1] + coef(m.A)[2]*xseq, col = "blue", lty = 3)
shade( mu.PI, xseq, col = adjustcolor("blue", alpha = .2))

Of course, the model is still consistent with an effect of stress. To show this, we can plot the counterfactual effect of \(\small A\) for different levels of \(\small S\). The plot uses 50% credible intervals.

par(mar=c(3,3,0,1), mgp=c(1,.5,0), tck=-.01)
plot(0,type = "n", ylim = c(-2.5,2.4), xlim = c(-2.85,max(xseq)), ylab = "W", xlab = "A")

for (s in c(-2,0,2)) {
  mu = link(m.AS,data=data.frame(A=xseq,S=s))
  mu.mean <-apply(mu,2,mean)
  mu.PI <-apply(mu,2,PI, prob = .5)
  lines( mu.mean~xseq,data=d,'l', col = "blue",
         ylim = c(-2,2),
         ylab = "W", xlab = "S")
  shade( mu.PI, xseq, col = adjustcolor("blue", alpha = .2))
  text(min(xseq),mu.mean[1], paste0("S = ",s), pos = 2)
}


  1. If one has multiple measurements of the same variable over time, each measurement would be a new node↩︎

  2. Note that this is true for the current model. If there would be another confounder, i.e. a common cause for \(\small B\) and \(\small C\), we would have to include it into the regression.↩︎

  3. Table 1 describes the sample↩︎

  4. Fixing \(\small D\) to a specific value is OK here because this is a simple linear regression without interactions. Things get more complicated when interactions play a role or non-linear link functions are used, like e.g. for logistic regressions.↩︎